Biostatistics For Dummies (Monika Wahi John Pezzullo)

Beware of the complete separation problem

Imagine your logistic regression model perfectly predicted the outcome, in that every individual

positive for the outcome had a predicted probability of 1.0, and every individual negative for the

outcome had a 0 predicted probability. This is called perfect separation or complete separation, and

the problem is called the perfect predictor problem. This is a nasty and surprisingly frequent problem

that’s unique to logistic regression, which highlights the sad fact that a logistic regression model will

fail to converge in the software if the model fits perfectly!

If the predictor variable or variables in your model completely separate the yes outcomes

from the no outcomes, the maximum likelihood method will try to make the coefficient of that

variable infinite, which usually causes an error in the software. If the coefficient is positive, the

OR tries to be infinity, and if it is negative, it tries to be 0. The SE of the OR tries to be infinite,

too. This may cause your CI to have a lower limit of 0, an upper limit of infinity, or both.

Check out Figure 18-8, which visually describes the problem. The regression is trying to make the

curve come as close as possible to all the data points. Usually it has to strike a compromise, because

there’s a mixture of 1s and 0s, especially in the middle of the data. But with perfectly separated data,

no compromise is necessary. As b becomes infinitely large, the logistic function morphs into a step

function that touches all the data points (observe where b = 5).

While it is relatively easy to identify if there is a perfect predictor in your data set by looking

at frequencies, you may run into the perfect predictor problem as a result of a combination of

predictors in your model. Unfortunately, there aren’t any great solutions to this problem. One

proposed solution called the Firth correction allows you to add a small number roughly

equivalent to half an observation to the data set that will disrupt the complete separation. If you

can do this correction in your software, it will produce output, but the results will likely be

unstable (very near 0, or very near infinity). The approach of trying to fix the model by changing

the predictors would not make sense, since the model fits perfectly. You may be forced to

abandon your logistic regression plans and instead provide a descriptive analysis.